庖丁解牛式读《Attention is all your need》

您所在的位置：网站首页 › commonly used啥意思 › 庖丁解牛式读《Attention is all your need》

庖丁解牛式读《Attention is all your need》

2023-12-04 09:51| 来源: 网络整理| 查看: 265

我的观点废话

弄清楚Transformer模型内部的每一个细节尤为重要attention机制首次被应用在nlp领域是在 2015年的一篇论文中：《Neural Machinie Translation by Jointly Learing to Align and Translate》这篇论文是transformer机制的首次提出，但并非是Attention机制的首次提出。所以我觉的文章是不是改成《Transfomer:Attention is all your need》会更好呢？本博文part1部分【section 0~7】是对源论文的翻译；part2部分【section 8】是对原论文中的一些问题进行释疑。本博文中英混注（我实在无力将有些句子&词翻译出来），不习惯的可以关网页撒~

update on 20201030

建议结合邱锡鹏老师的《神经网络和深度学习》一书学习，这本书讲得真是太好了 ღ( ´･ᴗ･` )ღ part 1. 《Attention is all your need》翻译 0.摘要

主流的序列推导模型（sequence transduction model）【后面答疑部分会解释什么是sequence transduction model】是基于复杂的 RNN 或者 CNN 网络的，这些网络都包含encoder 和 decoder 两个部分。性能最好的模型同样是通过注意力机制（attention mechanism）连接 encoder 和 decoder【这句话也能侧面说明Attention机制很早前就提出来咯】。我们提出一个新型、简单的网络结构 —— Transformer，它仅仅基于注意力机制，而不用RNN和CNN。当要求并行化且更少的时间去训练模型时，我们的（这个基于transformer的）模型在两个机器翻译任务上的实验证明这些模型【这里的“这些”指的是两个机器翻译任务上的两个模型，但本质上应该还是一个】在性能上更优。在WNT 2014 英转德翻译任务上，我们的模型达到了28.4的BLEU值，超越了当前的最佳结果，包括ensembles【后面答疑部分会解释何为ensemble】，都超过2个BLEU值。在使用8个GPU训练3.5天后（这仅仅是参考文献中最好模型的一小部分时间），我们的模型在WMT2014英转德翻译任务中，创立一个新的单模型SOTA的BLEU分数—41.8。我们将Transformer 成功地应用于具有大量和有限训练数据的English constituency parsing【这里的English constituency parsing到底怎么理解？见后答疑部分】，证明它可以更好的推广到其它任务中。

1.介绍

RNN，特别是LSTM 以及GRNN(gated recurrent networks)，在序列建模（sequence modeling）和转导问题中（如语言建模和机器翻译）已经被牢固地确立为最先进的方法。此后，无数的努力去不断推动循环语言模型和encoder-decoder结构边界。

递归模型沿着输入和输出序列的符号位置进行【这里其实为后文埋下了伏笔，表意是描述递归模型，但实指出递归模型的弊端：顺序处理，无法并行】。【原文：Recurrent models typically factor computation along the symbol positions of the input and output sequences.】在计算时间中，将位置与步骤配齐，它们产生一系列隐藏状态 h t h_t ht，作为一个之前的隐层状态 h t − 1 h_{t-1} ht−1 以及输入位置t的函数【原文：Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t ht,as a function of the previous hidden state h t − 1 h_{t-1} ht−1 and the input for position t 】

这种内在的序列特性阻碍训练样本的并行化，然而（并行化）在长序列（训练）中就变得十分关键，因为内存大小限制了批处理的规模。最近的工作通过分解任务和条件计算在计算效率上取得了明显提升，同时，也提高了后者的模型性能。然而序列计算的基础限制依然存在。

在各种引人注目的的序列建模和推导模型中，注意力机制已经变成一个密不可分的部分，允许建立依赖关系的模型，而不用在意它们（指的是句子中的各个词）在输入和输出序列的距离。然而，除了少数情况，这样的注意力机制大多数与递归网络一起使用。

在这篇论文里，我们提出Transformer，一种避开了递归、完全依赖注意力机制的模型结构，它（这种机制）可以获取输入输出间的全局依赖。Transformer允许显著的并行化，并且在8个P100GPUs 上仅仅训练了12小时之后就可以得到一个更优的翻译效果。

2.背景

减少序列计算（sequential computation）的目标也是Extended Neural GPU、ByteNet、以及ConvS2S等模型的基础，所有这些都是使用CNN作为基础块，对于所有的输入输出位置并行计算隐层表示。在这些模型当中，模型ConvS2S将任意输入输出信号联系起来要求操作的次数与这两者位置间的距离呈线性关系，而模型ByteNet则呈对数关系。这使得学习远距离的依赖变得更加困难。然而在Transformer中，这个复杂度被减至一个常数操作，尽管由于平均 attention-weighted 位置在以减少有效地解析为代价，但我们提出一种 Multi-head Attention（在3.2中讲到）用于抵消这种影响。

自我注意（self-attention），有时也称为内部注意，是一个注意力机制，这种注意力机制：将单个句子不同的位置联系起来，用于计算一个序列的表示。【原文：Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.】自我注意已经被成功应用在个各种各样的任务中，诸如阅读理解，抽象总结，文本蕴含，以及学习独立任务的句子表示中。

端到端的记忆网络是基于递归注意力机制，而不是对齐序列递归，同时在单一语言问答系统以及语言建模任务中，端到端的网络已经被证明效果很好。

然而，据我们所知，Transformer 是第一个完全依赖于自我注意的推导模型。在接下来的部分中，我们将描述Transformer，自驱动的自我注意力机制，并且讨论它们的优缺点。

3.模型结构

大多数的神经序列推导模型有一个 encoder-decoder 结构。这里，encoder 将一个表示输入的字符序列 ( x 1 , x 2 , . . . x n ) (x_1,x_2,...x_n) (x1,x2,...xn)映射成另一种连续表征序列 Z = ( z 1 , z 2 , . . . z m ) Z=(z_1,z_2,...z_m) Z=(z1,z2,...zm)。基于Z，decoder 每次产生一个元素 y i y_i yi，最终产输出一个序列 Y = ( y 1 , . . . y m ) Y=(y_1,...y_m) Y=(y1,...ym)。Decoder中的每个步骤都是自回归的 —— 消费之前产生的表征作为下次的输入。 Transformer 遵循了这种总体的架构：在encoder和decoder中都使用数层self-attention 和 point-wise，全连接层。相对的各个部分正如Figure 1中左右两部分描述的一样。 Figure 1:Thre Transformer - model architecture

3.1 Encoder and Decoder Stacks

Encoder 由6个相同的层叠加而成。每层又分成两个子层。第一层是multi-head self-attention机制，第二层则是简单的position-wise fully connected feed-forward network。在每个子层中，我们都使用残差网络，然后紧接着一个 layer normalization。也就是说：其实每个子层的实际输出是 LayerNorm(x+Sublayer(x))，其中Sublayer(x)是由sub-layer层实现的。为了简化这个残差连接，模型中的所有子层都与 embedding 层相同，输出的结果维度都是 d m o d e l = 512 d_{model}=512 dmodel=512 。

Decoder decoder 同样是由一个N=6个相同layer组成的栈。除了encoder layer中的两个子层之外，decoder 还插入了第三个子层，这层的功能就是：利用 encoder 的输出，执行一个multi-head attention。与encoder相似，在每个子层中，我们都使用一个残差连接，接着在其后跟一个layer normalization。为了防止当前位置看到后序的位置，我们同样修改了decoder 栈中的self-attention 子层。这个masking，是基于这样一种事实：输出embedding 偏移了一个位置，确保对位置i的预测仅仅依赖于位置小于i的、已知的输出。（原文：We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.）。【那这个只能利用单词前面的词，不是明显存在缺点吗？也不是，或许在encoder中就已经捕捉到了上下文的信息】

注：

【如果N=6，那么得到的transformer结构就是：在这里插入图片描述

只有最顶层的encoder 会把输出输入到各层的decoder中。】 3.2 Attention

一个Attention function 可以被描述成一个“映射query和一系列key-value对到输出”，其中query,keys,values,output 都是一个向量。其中计算值的权值和得到输出，这些权值是由querys 和 keys通过相关函数计算出来的。（原文:An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.）【注：原文中的query就是没有复数形式】 Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention

3.2.1 Scaled Dot-Product Attention

本文中使用的attention 被称作Scaled Dot-Product Attention(Figure 2)。输入包含了 d k d_k dk维的queries 和 keys，以及values 的维数是 d v d_v dv。我们使用所有的values计算query的dot products，然后除以 d k \sqrt{d_k} dk ，再应用一个softmax函数去获得值的权重。【需要关注value在不同地方的含义。是key-value，还是计算得到的value？】

实际处理中，我们将同时计算一系列query 的 attention，并将这些queries 写成矩阵Q的形式。同样，将keys,values 同样打包成矩阵K和矩阵V。我们计算公式如下： A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V 最常用的的attention 是 additive attention和dot-product(multiplicative) attention。除了使用放缩因子 1 d k \frac{1}{\sqrt{d_k}} dk 1之外，本文中使用的算法与Dot-product attention算法完全一致。Additive attention 使用一个只有一层的前向反馈网络计算compatibility function。然而两者在理论复杂度上是相似的，实际上，dot-product attention更快些，且空间效率更高些，这是因为它可以使用高度优化的矩阵乘法代码来实现。

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we scale the dot products by 1 d k \frac{1}{\sqrt{d_k}} dk 1 .

3.2.2 Multi-Head Attention

与在 d m o d l e d_{modle} dmodle维的keys,values 以及 queries执行单个 attention function相反，我们发现将queries， keys， values使用不同的、学习得到的投影，做h次线性投影，分别投影到 d k , d k , d v d_k,d_k, d_v dk,dk,dv维度。(原文：we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions, respectively.)在得到的每个投影版本的queries，keys,values 的基础上，我们再并行执行attention function，产生维度为 d v d_v dv的输出值。这些值被连接起来并再做一次映射，然后就生成了最后的值。如Figure 2所示。

Multi-head attention 允许模型从不同的位置的不同的表示空间中联合利用信息。(原文：Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.) 如果只是单头attention ，那么平均将会抑制这种状况。 M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . . h e a d h ) W O w h e r e h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) MultiHead(Q,K,V) = Concat(head_1,....head_h)W^O \\ where \space head_i = Attention (QW_i^Q,KW_i^K,VW_i^V) MultiHead(Q,K,V)=Concat(head1,....headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)

由于减少了每个头的维度，总得计算成本和单头的attention计算所有的维度的复杂度是相似的。

在本论文中，我们使用h = 8 个并行的注意力层，和头。对于这些，我们使用 d k = d v = d m o d e l / h = 64 d_k = d_v = d_{model}/h = 64 dk=dv=dmodel/h=64。由于减少了每个头的维度，总得计算损失与单个完全维度的attention 是相似的。

3.2.3 Applications of Attention in our Model

Transformer 用三种不同的方式使用mulit-head attention:

在encoder-decoder attention层， queries 来自之前的decoder层，并且the memory keys and values来自encoder的输出。这就允许decoder每个位置上去注意输入序列的所有位置。这模仿了seq2seq中经典的encoder-decoder注意力机制。encoder包含 self-attention 层。在self-attention 层中，所有的keys ,values,以及queries 来自相同的位置——上一层encoder的输出。（当前层的）encoder中每个位置都能注意到上一层encoder的所有位置。相同的， decoder栈中的self-attention 层允许decoder中的每个位置去注意decoder中所有的位置，（原文：Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.）

We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

3.3 position-wise Feed-Forward Networks

除了 attention sub-layers 【这里的attention sub-layers应该是一个名词】之外，encoder 和 decoder 的每层包含了一个全连接前馈网络，它分别相同地应用到每一个位置。这包含一个用ReLu做连接的两个线性转换操作。 F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2 FFN(x) = max(0,xW_1+b1)W_2+b_2 FFN(x)=max(0,xW1+b1)W2+b2 尽管不同的位置有着相同的线性转换，但是它们使用不同的参数从一层到另一层。另一种描述这个的方法是：可以将这个看成是两个卷积核大小为1的卷积。输入和输出的维度都是 d m o d e l = 512 d_{model}=512 dmodel=512，同时，内部层维度是 d f f = 2048 d_{ff}=2048 dff=2048 【其实就是一个简单的FFN，没啥特别之处。搞不懂为啥要叫position wise】

3.4 Embedding and softmax

与序列推导模型相似，我们使用embeddings 去将 input tokens和 output tokens转换成维度是 d m o d e l d_{model} dmodel的向量，我们同时使用通常学习的线性转换和softmax 函数，用于将decoder 输出转换成一个可预测的接下来单词的概率。在我们的模型中：在两个embedding layers 和 pre-softmax 线性转换中共用相同的权重矩阵。在embedding layers，我们将这些权重乘以 d m o d e l \sqrt {d_{model}} dmodel

3.5 positional Encoding

因为我们的模型没有包含RNN和CNN，为了让模型充分利用序列的顺序信息，我们必须获取一些信息关于tokens 在序列中相对或者绝对的位置。为了这个目的，我们在encoder 和 decoder 栈底加上了positional encodings到 input embeddings中。这个positional embedding 与 embedding有相同的维度 d m o d e l d_{model} dmodel。有许多关于positional ecodings的选择。

在本论文中，我们使用不同频率的sine 和 cosine 函数。公式如下： P E p o s , 2 i = s i n ( p o s 1000 0 2 i / d m o d e l ) P E p o s , 2 i + 1 = c o s ( p o s 1000 0 2 i / d m o d e l ) \begin{aligned} PE_{pos,2i} = sin( \frac{pos}{10000^{2i/d_{model}}}) \\ PE_{pos,2i+1} = cos(\frac{pos}{10000^{2i/d_{model}}}) \end{aligned} PEpos,2i=sin(100002i/dmodelpos)PEpos,2i+1=cos(100002i/dmodelpos) 其中pos 是位置，i是维度。也就是说：每个位置编码的维度对应一个正弦曲线。波长形成了一个几何级数从2π 到 10000*2π。选择这个函数的原因是：我们假设它让我们的模型容易学习到位置信息，因为对任何一个固定的偏移k， P E p o s + k PE_{pos+k} PEpos+k 可以代表成一个 P E p o s PE_{pos} PEpos的线性函数。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

4.为什么要Self-Attention?

在这节中，我们将主要比较self-attention 层和RNN/CNN层，RNN、CNN层主要用于将一个变长的序列映射成另一个同等长度的序列，这是encoder-decoder的传统任务（原文：In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi , zi ∈ Rd , such as a hidden layer in a typical sequence transduction encoder or decoder.）。我们使用self-Attention主要基于如下三个优点：

一是每层的计算复杂度变小；二是能并行计算的数目，这个数字是是由序列操作要求的最小数得到的。【什么是sequential operation? 见文章后半部分答疑区】第三是网络中远程依赖关系之间的路径长度。学习长距离依赖是一个关键性挑战在许多序列推导任务中。影响学习这种依赖能力的一个关键因素是：信号在遍历这个网络时，向前或向后的路径长度（原文： One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.）。输入输出序列任意两个位置的结合路径越短，那么学习到长距离的依赖就会越简单。因此，我们同时也比较了最长的距离：在由不同层的类型构成的网络中的任意两个输入输出位置。

正如在Table 1中显示的那样，一个self-attention层用一个常数级的序列执行操作连接了所有的位置，然而一个递归层（recurrent layer）却需要O(n)的序列操作（sequential operations）。在计算复杂度方面，self-attention 层是比recurrent layers 快，当序列长度n 小于表示维度d，而这恰恰又是在机器翻译中sentence representation 被sota模型最常使用的情形，诸如word-piece 以及byte-pair 表示。为了提升涉及到长句子任务的计算性能，self-attention 可以被设计成仅仅考虑一个在输入序列中以各个输出位置为中心的、大小为r的相邻窗口。这将增加最大的路径长度到 O ( n / r ) O(n/r) O(n/r)。这部分工作计划在未来完成~

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

附带的好处就是：self-attention 能够产生更多可解释的模型。在附录中，我们将以例子的形式展示 attention 对我们的模型的贡献。不仅单独的 attention heads 清晰的学习去执行不同的任务，许多（都能）展现出和 sentences的语义、语法相关的行为。

5.训练

这部分将用来描述我们的训练体系。

5.1 training data and batching

在WMT 2014 English-German 数据集（包含450万条sentence pairs）。sentences 用byte-pair 编码，这个库有一个共享的单词资源，大概有37000个tokens（也就是说这个字典有37000个单词）。对于English-French，我们使用较大的WMT 2014 English-French 数据集，它包括36M条句子，并且将这些句子划分成32000个单词组成的词表。句子对被batch到一起（具有合适的序列长度。每个训练的batch 包含一系列句子对，大概有25000个源tokens，和25000个目标tokens。

5.2 Hardware and schedule

在8个NVIDIA P100 GPU上训练模型。针对基础模型，我们使用文中描述的这些超参数，每个训练步骤大概需要0.4s。我们训练基础模型有两种方式：训练100,000次，或者是12小时。对于较大的模型（在Table 3中描述到），每个训练步骤大概1s。较大模型被训练了300，000次（达到了3.5days）。

5.3 Optimizer

我们使用Adam优化器，其参数为 β 1 = 0.9 , β 2 = 0.98 , ϵ = 1 0 − 9 \beta_1 = 0.9,\beta_2 = 0.98,\epsilon =10^{-9} β1=0.9,β2=0.98,ϵ=10−9。在训练的过程中，按照如下公式修改学习率： l r a t e = d m o d e l − 0.5 ∗ m i n ( s t e p n u m − 0.5 , s t e p _ n u m ∗ w a r m u p _ s t e p s − 1.5 ) lrate = d_{model}^{-0.5} * min(step_num^{-0.5}, step\_num * warmup\_steps^{-1.5}) lrate=dmodel−0.5∗min(stepnum−0.5,step_num∗warmup_steps−1.5)

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000.

5.4 Regularization

在训练过程中，使用三种正则化的方法。

Residual DropoutLabel Smoothing 没啥好说的，问题都指出来了，学就行了~ 6.Results

这部分待更新~

6.1 Machine Translation 6.2 Model Variations 6.3 English Constiuency Parsing 7. Conclusion

在这份论文中，我们展示了一个Transformer，它是第一个完全基于attention的序列传到模型，取代了传统中使用较多的递归层，代之的是Multi-head self-attention的 encoder-decoder架构。对于翻译任务，相比基于RNN/CNN的架构，Transformer 能够被快速训练。在WMT 2014 English-German 和 WMT 2014 English-French翻译任务中，该模型达到了顶级水平。在上述任务中，我们的这个模型超越了之前所有的ensembles。

对于基于attention模型的未来，我们是激动的，并且计划将其应用在其他任务上，如：涉及除了text之外的输入输出形态。同时，准备调查局部的，有限制的attention mechanisms 去更加高效的处理大输入、输出，诸如图片、声音和视频。

Making generation less sequential is another research goals of ours.

用于训练和评估模型的代码在：https://github.com/tensorflow/tensor2tensor 中

上面谈完了《Attention is All your need》的内容，下面进行本博客中最精华的部分——专题答疑

part2：问题总结 8.疑问什么是 sequence transduction models?

简单来说，就是做 sequence transduction 的 models，那么什么是 sequence transduction ？见下图： What is Sequence Transduction? 参考文献：https://www.cs.toronto.edu/~graves/seq_trans_slides.pdf

下图中的English constituency parsing 是什么意思？在这里插入图片描述

- There are two main techniques to analyze the syntactic structure of sentences: constituency parsing and dependency parsing

Constituency parsing is the task of breaking a text into sub-phrases, or constituents. Non-terminals in the parse tree are types of phrases, the terminals are the words in the sentence.

English constituency parsing 指的就是英语中的constituency parsing。不班门弄斧了，这里上一个大佬的链接：https://zhuanlan.zhihu.com/p/66268929

下图中的 including ensembles 是什么意思？在这里插入图片描述

Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges

简单来说：就是把各种模型的优点结合到一起，组成的一个新模型就叫做ensembles。

下图中的 typically factor computation 是什么意思？在这里插入图片描述

下图中的 h t h_t ht为什么是一个函数？即一个隐状态序列为什么是函数？在这里插入图片描述

什么是自回归模型？在这里插入图片描述

Put simply, an autoregressive model is merely a feed-forward model which predicts future values from past values.

下面结合一个动图来看：在这里插入图片描述

更简单的说，可以参考邱锡鹏老师的《神经网络和深度学习》chapter 6.1.2 ：在这里插入图片描述当前为t时刻，那么就表示成 y t y_t yt，所以 y t − k y_{t-k} yt−k就应该理解成t-k时刻。

参考文献：

https://arxiv.org/pdf/1308.0850.pdfhttps://eigenfoo.xyz/deep-autoregressive-models/

point-wise 是什么？在这里插入图片描述我觉得这个名字没什么意思，就是种叫法而已~

下图中的第一个position 指的是什么？This masking又是指什么？为什么需要mask？为什么只偏移1个位置？在这里插入图片描述

需要mask的原因是：因为是decoder(生成结果)过程，因此在时刻 i （这种时刻i是怎么在代码中体现出来的？）的时候，大于 i 的时刻都没有结果，只有小于 i 的时刻有结果，因此需要做Mask。这个后面还会再深入谈。

什么叫encoder-decoder attention？为什么 queries 来自decoder 层？在这里插入图片描述这个问题用图来解释最好不过。见下图： Decoder 的每个sublayer 都有两个 multi-head Attention。但是需要注意有一个是跟Encoder 的输出结果进行联合计算的，也就是上面蓝色虚线框中的部分。下面也标出来Q（这里的Q就是文中的queries）是从Decoder中出来的，K,V则是沿袭Encoder 的输出。但是为什么这么设计呢？我暂时还不清楚。

下图中的 another sequence of equal length 是什么意思？是要跟原序列一样长吗？在这里插入图片描述这里指的是将self-attention 同之前被常用于等长翻译的的RNN、CNN层做一个比较。这里的细节（比如是什么等长？seq2seq中的RNN还等长吗？）问题可以继续深究一下，但是我觉得没啥必要~

sequential operations required 指的是什么？在这里插入图片描述这个 sequential operations 通常是用于衡量计算复杂度与并行度的。下图就给出了在设计RNN时，通常会考虑的一些比较准则。

下图中的 representaion dimensionality 是什么意思？ setence representations 又是指什么？在这里插入图片描述这个还待研究，(。・＿・。)ﾉI’m sorry~ 【写csdn博客体验真差，如果有图片的话，那么编辑过程就上下切换，一直跳动！】

13.下图中的 Decoder的输入（也就是标的output embedding）是什么？在这里插入图片描述这位字节的大佬写的比较详细，在其section 1.3 部分。我简单说一下我得理解：

如果Decoder是第一次运行，那么这个时候的 outputs embedding 就是起始标志符（因为我们要预测第一个单词）如果Decoder不是第一次运行，那么这个时候的 outputs embedding 就是之前预测结果的embedding序列下面举个例子解释一下：在这里插入图片描述

那么问题又来了，见下。

14.为什么decoder的部分还是有time step1 -> time step 2 ->.... -> time step n这种？难道不应该是一次性的运行并输出吗？

在encoder 部分，可以并行计算，一次性的全部encoding。但解码不是一次把所有序列解出来，而是像RNN 一样一个个的解出来的，因为要用上一个位置的输入当做attention 的query。

在这里插入图片描述参考博文：https://blog.csdn.net/weixin_37947156/article/details/90112176

15.transformer中decoder 是怎么做到time step 的啊？比如下面的这个蓝色框中”落下来“的操作是什么？这么感觉decoder有点儿循环的意思，但是源码里面我没找到有实现这种循环的地方。在这里插入图片描述经过我和朋友的讨论，我们得出如下结论：

在训练时，decoder 并非是真的按照time step来进行的，也就是说：不存在一个对decoder的for循环。这个“落下来”的操作是通过 sequence mask实现的。因为在训练的时候，当前的单词并不能看到后面的单词，而且采取的是teacher forcing的训练方法，就会有一种“落下来”的效果。在训练时，decoder也是并行进行的。理论上解码器的的输出不是一次就能产生最终结果的，而是一次次通过上一次结果综合得出的。但是为了并行处理，所以就需要进行遮掩从而避免未来的信息被提前利用。

Attention 中的Q,K,V是什么？见下图：在这里插入图片描述 Q,K,V是Attention处理中至关重要的三个矩阵，它们分别是由q,k,v向量组合而成。因为每个词都会对应一个q,k,v向量，那么将一句话中的各个词对应的q,k,v取出来，就组合成了一个矩阵。这三个矩阵主要做的操作就是 A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V

详细讲讲 scaled Dot-Product Attention中的每层操作是什么在这里插入图片描述

Q,K,V: 上面说过了，这里不解释了MatMul：也就是Matrix multiplication，为了计算 Q K T QK^T QKTScale: 因为维度比较大，做内积之后，得到的值可能比较大，所以想放小一些，也就是除以 d k \sqrt{d_k} dk 的操作Mask(opt.) ：应该说的是这个操作是可选的，这里不用softmax: 对得到的score进行一个归一化操作。这个score的各个分数就是 Q K T QK^T QKT得到MatMul：这里的矩阵乘法是为了计算 s o f t m a x ( Q T T ) softmax(QT^T) softmax(QTT) 和 V V V的乘积详细讲讲下面这个Multi-Head Attention 的各个操作在这里插入图片描述

V，K，Q ：上有提及，不再解释。而且我对这个输入有些疑问，感觉这里的输入不应该是V,K,Q，而是一个待处理的词向量X。查一些资料发现还是应该相信论文。这里的V,K,Q 不是由 W V X , W K X , W Q X W^VX,W^KX,W^QX WVX,WKX,WQX得到的，而是从Encoder中传入的。Linear：做一个线性转换操作【我查了一些资料，的确有一些确定是做了线性变换】。但是不理解为什么要这么做？要知道我们在得到Q，K，V 也是通过线性变换得到的。关于这个线性映射，原论文这么说的：

we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d k d_k dk, d k d_k dk and d v d_v dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d v d_v dv-dimensional output values.

问题在于为什么这么做会beneficial？原文是这么说的：

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

看下面一张图，讲解了这个linear的过程：在这里插入图片描述

Scaled Dot-Product Attention: 就是上面我们解释的那个东东。h代表的就是头的个数。也就是有几个attention。Concat: 因为每个attention 都会得到一个结果，这里将结果进行拼接成一个完整的向量Linear: 如果拼接成的向量的维度不是我们想要的，就可以再执行一个 Linear操作将其转换

参考资料：

https://www.cnblogs.com/rosyYY/p/10115424.htmlhttps://blog.csdn.net/black_shuang/article/details/95384597

Decoder 端的Mask 在这里插入图片描述参考资料： https://zhuanlan.zhihu.com/p/104393915

下图中的 maximum path length指的是什么在这里插入图片描述如果你还对这个问题有疑问的话，我觉得你需要重读一下问题11的答案了。可能看了还是会充满疑惑，这里再补点儿资料：我觉得这个概念结合RNN会更好理解一些~

什么是greedy decoding?

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding).

什么是beam search?

Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

上面说的the top two words指的是概率最高的两个单词。我们保留前k个最大概率的当前输出，然后分别计算这k个当前输出同之前的输出组合在一起和原句子的loss，取loss最小的当前输出。重复这个过程直到完毕。

读到这里，你以为就结束了，那真是太天真了，你以为这是XX专业吗？没有动手部分，怎么是cs？下面我结合Harvard Nlp用pytorch 实现attention 的源码进行了细致讲解。

9.源码

简单看一下代码吧：

EncoderDecoder类在这里插入图片描述

该类的forward()方法在这里插入图片描述

SublayerConnection 部分在这里插入图片描述

内容实在是太多了，这里就不展示了。如果想要看仔细内容，麻烦去我GitHub 瞅吧。该库（wheelsmaker）(将会)实现（了）我读研路上的paper 实现，希望能帮助到你~ 10.参考文章

这里不再列出来本文所有的参考文章了，我几乎都在文中标记了出来。简单列一下：

https://mp.weixin.qq.com/s/7RgCIFxPGnREiBk8PcxOBghttp://nlp.seas.harvard.edu/2018/04/03/attention.html

【本文地址】

庖丁解牛式读《Attention is all your need》

庖丁解牛式读《Attention is all your need》

今日新闻

推荐新闻